2. Backend Scaling and Performance Engineering Part-2
In the previous video, we were talking about different types of scaling and the last thing that we discussed was horizontal scaling and this is the continuation of that video. So if you are watching this video the first time, so please watch the first part of the video to have better context. We'll just jump ahead and continue from where we left off. I divided this video into two parts because of some frame corruption issue in my recording software. Dividing videos into part one, part two is not my
intention, but I had to do it for this one. Anyway, continuing from where we left, a very important property when we are talking about horizontal scaling is statelessness. The key enabler or the key property that makes horizontal scaling possible is this property which is statelessness. And what do we mean by statelessness? If we have our backend applications and we have multiple instances of our backend application that is what we mean by horizontal scaling if you remember which is adding different machines of the same capacity
instead of increasing the capacity of a single machine. That's what we mean by horizontal scaling. So for horizontal scaling we add multiple instances of our back end which means multiple servers which is which are running the same code. And when we finally deploy our application which is multiple instances of our application over the internet and the internet traffic comes our users start making request to our back end or
they use our front end which in turn makes request to our back end. All these requests from our client is distributed among all these instances. But before that we are discussing about statelessness. Now the only reason horizontal scaling is possible because of this property which is called statelessness which means this is possible because one server one instance one machine in this chain does not contain any kind of data which is exclusive to this server. That is what
we mean by being stateful. When we say stateful we mean that our server holds some kind of information, some kind of data which is exclusive to that server. It remembers some kind of information about the client, about the user and it stores that information in the server. And if something like that happens, if let's say this is instance A, B, C and D. If the instance D holds some amount of data which the instance C or the instance A does not have access to and it is exclusive to this instance only
then our horizontal scaling will not work as expected and we will see weird errors pop up in our entire stack. So this is what we meant in the initial part of the discussion when we said horizontal scaling requires planning your entire stack from ground up. It requires changes in our code as compared to vertical scaling which only means we go to our infrastructure management dashboard and we just add more capacity and more CPU cores and more memory to
our instance as compared to horizontal scaling where we have to plan from the ground up. It will affect our code, it will affect our infrastructure and it will affect our entire stack. In horizontal scaling, we have to ensure statelessness which means that it does not matter which server the request goes to. Let's say this is our internet and our users communicate via their browser and through internet our servers get the requests. And what we are saying is it
does not matter what server ends up getting a particular request. The result should always be the same. Which means even if one of these servers get deleted. Let's say we get rid of B. Even in that case A combined with C and D should have the exact same behavior. That's what we mean by statelessness. No instance of our server should hold any piece of information which is not accessible by all the instances. Which means that any kind of data if we need
to persist any kind of data then that should live somewhere here outside of all our servers which can be accessed by any instance of our server. Okay, that's pretty much what we mean by statelessness. And a couple of examples of where we exactly use this statelessness property. One is sessions. In our typical authentication flow, when a user comes, they enter their password, they enter their email and our server gets that request, it checks whatever
authentication procedure and finally it has successfully authenticated the user. It creates a session for that user and it saves the session ID in the user's browser cookie and also it saves the session somewhere in the server so that the next time a new request comes with the session ID cookie it can check the stored session and it can match with the ID and it can verify the user's identity. That's how a typical state full authentication works. Now if let's
say for our horizontal scaling scenario we have three instances of our server A B and C and the user while authenticating it communicated with the instance A of our server and instance A created a session stored the cookie in the user's browser and created a session variable or the session session instance but what it did it stored the session information inside the memory which means it had something like an array and it stored the users information here along with other users information but the instance B and instance C since this
is an in-memory data structure they don't have access to this and in turn what happens because we have a horizontal scaling setup and we have not discussed the load balancer part but next time the user requests the request ends up with the instance B now B does not have access to the session information of this user since the user authenticated with instance A and instance say stored information inside its own memory. Now what happens? B throws a 401 error and it says that you have to authenticate first which creates
a very confusing a very frustrating user experience for our user. So for any kind of session related information that is a clear use case of the property of statelessness. So instead of storing it inside its memory what instance A should have done it should have stored it in a inmemory database storage but but it should use something like radius and this radius instance should be accessible by all the instances so that the next time a new request comes to
instance B what B can do it can check in the radius instance since the instance A after authenticating the user saved the session information inside radius and since the radius instance is accessible to all the instances B can make sure that B can verify that the user is already authenticated and it will be successfully able to process that request and this is how statelessness property helps with session related use cases similarly if we let's say talk about file storage same use case user A
uploaded a file and because of our horizontal scaling setup the first request ended up with instance A what instance A did it took the file and it saved it in its own storage SSD or whatever storage it has it just saved that file inside its own storage and because of that the next request when it landed on instance C this C did not have access to that and it threw an error. In this case again to support our horizontal scaling architecture instead of storing it inside the file storage of
instance A it should upload it to some kind of object storage something like S3 or cloudflare R2 or maybe their onremise menu instance whatever but storage and object storage ideally which is accessible by all the instances. So this kind of pattern will keep repeating but the intuition is very clear. Once you have decided that you are going with horizontal scaling you have to make sure that at every point of time at your code level you have to make sure that no piece of information no piece of file nothing should be saved inside a
particular instance. Everything should be centralized. So for in-memory database you should use something like radius. for file uploads S3 or any kind of blob storage. Same way for databases instead of using something like SQLite and saving the DB file in your own server you should use something like centralized SQLite or posgress or RDS whatever that is the thumb rule basically if you're using horizontal scaling make sure that no piece of information it's stored in a single instance and always think about the
statelessness property of your whole architecture. Okay. Now let's talk about load balancer. We have been delaying it for too long even though we have mentioned it so many times. Load balancers are one of the very important and one of the mandatory components of a horizontal scaling architecture and we'll see why without a load balancer the setup of horizontal scaling is almost impossible. In fact, it starts by thinking about a load balancer by setting up a load balancer. The intuition is very simple if you think
about it. So we have already said that in horizontal scaling we have multiple instances of our server again repeating the same thing and let's say these are our users who want to communicate and they are each communicating through a web browser and this is the internet internet and these browsers are making requests through the internet. Now at this point if you think about it if we have multiple instances of our server at
what point in our whole stack do we decide that when a user makes a particular request which instance do we send that request to how do we decide and more importantly where exactly do we decide where exactly do we make that decision and that's exactly where the role of load balancers come in. So you can imagine a load balancer at somewhere around this point. In short form we call it LB a load balancer. So again the solution is pretty simple. All the requests that are coming from all the
clients from all our users they all go to our load balancer instead of going to our servers directly. They all go to the load balancer. Now load balancer what it does takes all those requests and there is some amount of logic here where it decides which instance should it forward this particular request to and we'll talk about this particular logic but after it decides which instance it should send this particular request to it just redirects those requests using a
piece of logic using a using a few algorithms and the server instance takes the request they process it and return the response in whatever form that we are sending and the load balancer again takes that and returns in the same HTTP connection. That part is pretty straightforward. The load balancer is pretty much a middleman which takes the requests from the internet and forwards them to our server instances, takes the
responses back and sends them back. The important part is this the piece of logic where it decides to which instance exactly should we forward this particular request. That's what we call as the load balancer algorithm. The idea is simple here. Our load balancer wants to decide which server it should send the request to. The first and the simplest kind of algorithm that we have is the roundroin. Round dropping which works in a way. Let's say this is our load balancer and these are our servers.
So for roundroin algorithm it sends the requests in a rotating order. What that means is let's say the first request comes this is the internet and this is getting all the requests here. The first request comes and it sends that to server one. Second one to server B and third one to server C. These are A, B, C. Okay. And the fourth one comes again it goes to server A, B and C for the sixth one and the cycle continues. That's why we call it a roundroin fashion or a rotating order of sending
requests. And this kind of algorithm the roundrobin algorithm works best in a setup where the requests that are being received by the load balancer they are similar kind in the sense the the the cost of those requests the kind of operations that we are performing the server is performing like database operations or Q operations whatever they are roughly the same kind and the servers hold the same amount of capacity each have let's say 4 GB of RAM and
let's say two cores of CPU. If we have this kind of setup, then a very simple roundroin algorithm for our load balancer works like we can get through a lot of traffic with a simple algorithm. But consider we have two kinds of requests. Let's say request one and in red we have also request two. request one what it does it does a very simple database read operation select operation some kind of select star from users etc
right let's say fetching the user profile which performs roughly depending on which region your database in where are your users but on an average you can say takes around 200 to 300 millconds but then we have our request two type which is a which we consider as an expensive request and what do we mean by expensive request. Let's say after getting the request, it has to make an HTTP call to an external server. Let's say something like elastic search or
you're sending an email some kind of external server which is not our own server. It has to make a call to an external server. After getting the response, it has to make a database write operation. The right operation happens on a very big table which has a lot of rows and a lot of indices. Also as we already know indices even though they help speed up select operations help speed up our join operations but every time we do a write operation any kind of modification like update or inserting a new row it has to update all
the indices of our databases. So let's say we have a couple of indices here and every write operation is expensive because of that. So all in all this whole API call with the external service call the database write operation and everything it takes around 2 seconds to respond. Now our load balancer since it is using a simple roundroin algorithm it sends the request in a pretty mindless way. Let's say it got the request in this order. The first request was of
request type one. The second one was of request type two. The third one again request type two. The fourth one was again request type one. And since the load balancer has no idea what kind of request that we are dealing with or any kind of intelligent logic to route the request, it will keep sending without realizing all the expensive requests to a single server and there is a chance that this particular server will eventually crash if a skew like this happens in our request traffic. So there
is a different variation of the round robin algorithm. Let's say this is again our load balancer. We have our servers here A, B and C. And the load balancer is sending the requests in a roundrobin fashion. But there is a variation which is called weighted weighted roundroin which means that let's say our server A has 8 GB of RAM and four cores of CPU and server B has again 4 GB of RAM, two core CPU. C also has 4 GB of RAM, two core CP. If we have a setup like this, what we can do? We can configure our
load balancer in a weighted roundrod robin fashion so that it sends twice as many requests to server A as compared to the other servers. So if it gets a request burst eight requests first it will send two requests to server A then one request to server B then one request to server C then again two requests to server A and so on. So server A will always receive twice as many requests. Since server A has more amount of
capacity to deal with the request to process those requests. But even if you have this kind of setup, the load balancer still cannot decide intelligently what kind of request should be sent to which server. So we have a different algorithm for our load balancer which we can consider a little smarter than the roundroin algorithm which is called leash connections. And the way it works is again we have a load balancer and three servers instances A, B and C. And again we have all the three
servers with the same amount of capacity 4 GB RAM and two CPU codes. And we are getting all the requests here. But the way this particular algorithm works is instead of sending the request mindlessly in a rotating order, what the load balancer does, it checks which server currently holds the least number of active connections and depending on that it will send requests to the other server. And since you already know that the HTTP protocol, the one which we typically use in our backend front end connections, it works on a request and
response cycle. The client sends a request and the server sends a response back with a code something like 200 400 or whatever and the client the browser it sends the request and waits for the response and while it waits for that response we call that as an active connection. the HTTP connection and behind the scenes the TCP connection is active while the server is processing while the server is making that external API call while the server is uh doing that database write operation and
whatever heavy operations that the server is performing at that point the connection is still waiting for the server's response and after the server is done it sends a response back and the connection is closed and depending on whether you are using HTTP2 HTTP 3 the connection closing the connection multiplexing those logic will be different but in a very simple manner if you are talking about HTTP 1.1 the server will send the response back and the connection will be closed. So for the least connections algorithm how it works is for expensive operations the
connection will stay active for a longer amount of time. So let's say we again have two types of requests the request one which is a very lightweight request type and the request two which is a heavy request type. So let's imagine the first request that the load balancer sends is the server A. Since when all the servers are started, none of the servers have any amount of active connections at that point. The load balancer can choose any servers in a random order. So for the first request,
it chooses the server instance A and the request type is obviously the request type two which takes 2 seconds. Then comes the next request which is of the request type one. And then the load balancer checks the server instance B and the server instance C are empty and they have zero number of active connections. So the second request which is the request type one goes to server instance B. And if there is a third request which is again the request type
one, it goes to server C. Then comes the fourth request. And by the time the fourth request comes and since we already know that the request type one is a lightweight request which resolves in let's say around 200 millconds. So by the time the next request comes the server A is still serving the request type two since it is a heavy API call and the connection is still active. But by the time the next request arrived which is the request number four either server B or server C were done with the
initial request and their connection went to zero. So the next request will go to either server B or server C. And since server A is already dealing with a heavy API call it will not be sent more request to process. So by tweaking our algorithm to consider the active connections, we can make it decide intelligently which server should get the next request which performs a little better than just using the roundrobin algorithm which just sends the request in a rotating order. There is also a
variation on top of the list connections algorithm the weighted version of it which means that we'll use the same algorithm but again the server capacities will change. Instead of all of the server instances having the same capacity, let's say server A has 8 GB RAM, four core CPU and B and C has 4 GB RAM and two core CPU. So it will use the same algorithm but the server A the instance A will receive twice as many requests by using the same algorithm. And similarly we have other load
balancer algorithms. Algorithms like least response time which means that the load balancer will check which servers or which server instances are returning responses much faster and they will try to send more requests to those servers and in turn the servers which are sending slower responses which are already struggling with the resources they will get less number of resources. Same way we also have resource-based algorithms which will check which server
has currently high amount of CPU or high amount of memory or RAM usage and the servers which have less amount of resource usage the CPU or RAM they will receive more number of requests right similarly there are other algorithms but the idea is always the same the load balancer depending on some piece of logic it decides which server instance it should send The next response to the next interesting question that you might ask is let's say here we have a load balancer and three server instances A B
and C and let's say the load balancer was following the roundroin algorithm or least connections algorithm whatever whichever you want to take and for some reason our server A crashed it got more amount of request because of our simpler roundroin algorithm or maybe some kind of infrastructure misconfiguration happened and server A crashed while using the list connections algorithm and it is not able to respond to any more requests since this machine is completely stopped. It is not working. Now if the load balancer is using
roundroin algorithm or if the load balancer is using list connections or any other algorithm what will happen if it is using the roundroin algorithm it will keep sending the request in rotating order. Let's say first A second B second C then A then B then C. So every request that goes to the server A or the server instance A the request will obviously fail. The client will get something like 502 or 503 and those clients those users will keep getting these errors and the other users they will not get any errors because their
requests are going to server instance B or server instance C. But again since this is roundroin the next request of the user whose request was previously served by server instance B the next time it might go to server A and they might also face the same errors which can quickly become a very frustrating user experience. So how do we actually solve this problem? How do load balancers they solve this problem of a particular server instance going dead or a particular server instance not being able to serve any more requests. So
that's where we have this concept called L checks which is a very simple but effective technique. What happens we have our load balancer here servers here and these are the typical requests that are coming from our users that the load balancer is forwarding to the server and returning the responses back to our users. While that is happening, what the load balancer does, it keeps sending its own request, test request, a very simple, let's say a get endpoint to all
the servers. Let's say every second, every second. While sending these actual requests, the requests that are coming from our users, it also sends a test request every second to all the server instances. It keeps sending us. And in the response, it expects a success response, which means something like 200 201. But of course the practice is to send a 200 response. So if the server is online, if the server is currently processing request, it will obviously get that test request and it will send
200 response back immediately because usually it's just a test request. No kind of heavy operation happens here. So it does not really cause any kind of load on our instances because of us sending this test request every second. But the moment let's say server A the instance A went down and the next second when the load balancer sent a test request to the server A it did not send a 20 response and by default the server the load balancer got something like 502
response from the HTTP client and the moment it gets a error response or any kind of response which is not the 200 series responses it will immediately put this server under blacklist. list or some kind of list which means that we will not be sending any kind of user requests which are the requests that are marked in blue here that are coming from our actual users. Those requests will not be sent to this particular instance from now on. So all the user requests
they will go to either server B or server C depending on whatever algorithm that you are using round robin list connections or whatever but meanwhile it will keep sending this test request to this server instance otherwise how will it know that this server instance has come back online it is able to serve responses now so these test requests they will keep going here and after it gets a successful response something like 200 then it will remove that instance from blacklist and from the
next request onwards it will also send these blue requests the requests that are coming from our users to server A. So using this simple technique called health checks our load balancer can decide whether a server is healthy or a server is able to cater to our users requests. So that's pretty much all about load balancers and horizontal scaling that I wanted to talk about. Of course, it's not comprehensive in any way, but it's good to know what are the components that play an important role
while you are thinking about scaling your backends and load balancers are a very big part of scaling your backends. That's a very brief introduction but enough so that you can continue your own reading on load balancers. Okay, next let's talk about database scaling. Now scaling our application, our back end, whatever code that we write our application, scaling that part is or becomes straightforward once you externalize state. That's what we learned, right? Because the clear winner
was obviously horizontal scaling, not vertical scaling. And the key to horizontal scaling was externalizing your state. Instead of limiting any kind of information, any kind of state or any kind of connections, any kind of state to a particular instance of the server, you externalize it. Which means you do not keep any kind of cache limited to a particular server. You do not keep any kind of files limited to a server instance. You do not keep any kind of session data or any kind of information that you can think of. You externalize
everything. You use radius. You use databases for all the session related information and use a object file storage like S3 for storing your files instead of keeping in your own local instance. Right? All the stuff that we discussed. Once you externalize your state, you add more servers behind your load balancers and your capacity will increase linearly the more number of servers that you can add that you can afford. But while you're doing that, the part that becomes a little difficult to scale or a little tricky to scale is your database, which is a stateful part
of your architecture. Because your database, they cannot be duplicated the way you just duplicated your server. Each database they hold some amount of data, a lot of data in their files, in their local storage. And that data again must be consistent. Which means if you have multiple database instances, when a request arrives and that particular database instance receives that request, the data it sends back, that should be exactly the same as the other instance
of the database sending the data back and making this coordination work. That's what makes scaling databases a very tricky part of scaling your overall backend architecture. So of course we already have a couple of solutions, a couple of tried and tested solutions in our industry for scaling our database component. The stateful component of our backend architecture. The first one and the obvious one in most of the use cases is called read replicas. Now read replicas it's a very simple and an
intuitive architecture of scaling databases. So here is our single database instance and since the number of servers since going with horizontal scaling they are growing because we want to cater to millions and billions of user traffic. So all these servers they are still dealing with the same database instance right we were able to scale our application code our processing logic but our storage logic which is the database component which we interact with to store and retrieve data that is
still a single instance and eventually this particular database which is again a server let's say it has 4 GB or RAM and four cores of CPU this is still a server and eventually it will run lot of resource to cater to all these requests that your back end is demanding from your database. So we also have to figure out some way of scaling our database. Read replicas are one way to do that to scale your database operations across
multiple instances. And since database are a stateful component in our architecture, we cannot just add more instances, right? They will have all this problem of data consistency. So one popular architecture that we have is the read replica and this is how it works. We have one database instance which we call as primary or parent whatever or master. There are multiple terminologies here. You can call it primary or master instance whatever. This one is the primary database instance and we have
three replicas which are secondary instances or child instance or slave instances. Again depends on the terminology but the idea is all these instances of our database they will hold the data but these instances they will only cater to read requests which means select ones. You can only read data from these instances. You cannot write data back to these instances. So that is the number one rule. And typically what we do we create these replicas or instances
of our database which holds all the data that the primary holds and we spread it across the world. So if you have users in India and US and China then you can have one primary instance the primary database in US and you can have three read replicas one in India one in China and one in Japan etc. So users that are sending request from those countries of course assuming that you also have your
backends spread across these countries then those backend instances combined with those database instances will be able to cater to the users of those countries and in turn the requests and the responses will be faster as compared to all the requests coming to a single instance of US. So that is there there are two benefits of reader replicas. One is low usage or less amount of load that your primary database instance has to take up and the second one is latency. Your back end will experience less
amount of latency. So read replicas are obviously a very good solution. all your read queries and if we take a look at most of the applications at least SAS applications where the context of backend comes primarily 90 or let's say even if you are being conservative 70% of the requests that we receive are read requests the back end mostly does select requests to our database right those are read requests and all those 70% of the
requests will be served by these replicas read replicas And the rest of the 30% of the requests that are things like insert or update or delete all these requests any kind of a write operations based request they will go to our primary instance. So initially our database had to deal with 100% of the requests. Now it came down to only 30%. So the resource utilization will be much lower. the resource load will be much
lower and we were able to scale our databases linearly. Now, of course, it does not come without its trade-offs or problems. The problem is consistency. And what do we mean by consistency? Let's take an example. Let's say we have a user here. We have a user here from their browser. They just updated their name in their user profile page. And since it is a write operation, that
particular request by the back end, let's say there is a back end here, back end instance, that particular request went to our primary database instance or our primary replica which handles any kind of write operation, right? And the moment they updated their name, they refreshed the page. And since page refreshing means fetching the user profile data back from our database and sending it to the client to the browser
the request type is a read request. It is not a write request and that immediately after let's say only a few milliseconds 200 millconds the request obviously went to a read replica. Now let's assume that this interaction happened from India and the primary database instance that is in US. So the write operation happened in our primary database instance. So this particular database instance currently holds the updated data the updated name of the
user. So let's say the user changed the name from A to A. Now at this point of time at this particular 10 milliseconds range only this primary database holds the updated data. Of course if we have this read replica based architecture there will obviously be some kind of replication algorithm right we cannot go into the depths of the replication algorithm and all but most of the modern databases posgress MySQL SQLite on all
these databases they have native features to support replication right so you can imagine there is some algorithm some kind of setup happening here using which the data from our primary database flows back to the read replicas, right? It makes sense intuitively if you have this kind of architecture. If these replicas are a copy of our primary databases, then there has to be some kind of mechanism when any kind of updated data flows back to all these read replicas. But even if that
mechanism is in place, we still have to consider physics. The distance between these two servers, these two physical servers from US to India, there is some amount of latency there. let's say 200 milliseconds or 300 milliseconds because at the end of the day our data travels through optic fiber cables from under the sea or any kind of medium so that one server can send some data to another
server and the physical distance between two servers have the largest role to play when it comes to latency. We cannot beat that. We cannot beat physics. So the distance between US and India assuming the fastest medium assuming the latency is 200 millconds. So the replication lag what we call the technological term for this is called replication lag which is 200 milliseconds. But this particular interaction usually in our client apps what happens usually in our form fields
especially in our form fields we make an update to that particular field and we click on the save button and the moment we click on the save button the request goes to our server. Our server does the write operation and it sends something like a successful response either 20 0 or 201 if it created a resource. Now the moment it receives a successful response if we are talking about SP single page
applications or server rendered pages here the client framework it issues a get request which means we want to fetch the latest form field data for this user profile. So the moment we get a successful response in that millisecond only we fire another get request and since this is a read request it goes to the read replica and since it has not been 200 milliseconds so the replication is not complete. This particular read replica does not have the updated user
data the updated user's name which is AB. So even though our client said that your username has been updated, your name has been updated from A to A but in the next moment the read replica will send an outdated response which is A. So that particular user interaction will become a very confusing or if it is a more sensitive operation something like payment related operation or invoice related operation it can also cause a
lot of damage to our application. So this problem of consistency is the biggest challenge when we are talking about read replica based architecture for database scaling. Of course over the time there has been a lot of proposed solutions and a lot of implemented solutions to deal with this problem of consistency in a read replica based architecture. A couple of them are if some kind of write operation happens. So that will go to our primary database instance right and the right operation happened on the user profile entity the
user profile model the user profile database table whatever. So one solution that we can do is since users table was updated for this particular interaction whatever read query that comes after that particular write query we route it to our primary instance instead of routing it to our read replicas we route it to our primary instance. That is one solution that you can implement routing requests intelligently. Another thing
that you can do is you can track this replication lag that we talked about which is like 200 milliseconds to copy updated data from our primary instance to our read replicas. That latency is called the replication lag. So you can keep tracking this replication lag. So average something like 200 250 milliseconds. And while that is happening you can block all your read queries or you can make all your read queries wait until the replication is complete and only after the replication is complete you send the updated data
back in the response. So that is another solution that you can implement. Apart from that you can also architect your front end in a way that the updated data is not fetched that instant only. there is some amount of planned latency there after let's say 300 milliseconds you fire the next get request. So all these different kinds of solutions that we need to implement and the kind of solution that you will implement depends on what kind of tradeoff that you are
willing to do. Every time you are talking about distributed systems the one that we talked about here the read replica based architecture which is a distributed system concept. So every time you are talking about a distributed systembased solution, you always have to make some amount of trade-off and most of the time it is about consistency. How do you make sure that all of your instances have the updated information about your server or your users? But anyway, all these modern database
providers like let's RDS, Google Cloud, SQL and these manage database providers, they make it very easy to implement read replicas and to implement read replicas in particular regions. It just takes a few configuration flags or a couple of toggles in UI and your read replicas will be ready. And they also have a lot of tried and tested robust solutions to deal with all these different kind of consistency based problems for read replicas. You as a backend engineer will
not have to worry about these deep infralevel problems. But it is still nice to know that these kinds of trade-offs always happen in your infrastructure based setups especially when you're talking about database scaling. Now moving on, another famous technique that will keep coming up whenever we are talking about database scaling is sharding also known as partitioning. Okay, sharding basically means that let's say you have a orders table. you are an e-commerce based
application something like Amazon, Flipkart or whatever and you have a database table where you store orders information and of course for an e-commerce based application you get thousands of orders millions of orders every day. So your orders table eventually becomes billions of rows. And whenever you want to fetch some data from this orders table, you run something like select star from orders where user id equals to five. Right? If
you want to fetch all the orders of a particular user, user went to their orders page and you want to fetch all the orders of that user. So you'll run a query like this. But running a particular query like this on a table which has billions and hundreds of billions of rows, it can take a long amount of time and even though you have index implemented for let's say the user ID column, it will still take a lot of time because the volume of rows billions and hundreds and thousands of billions
of row that we're talking about if you are a very famous e-commerce based company. to deal with query latency. That is the first problem that we have. The second is obviously we are talking about database scaling and database scaling in the context of scaling so that our databases can handle a large amount of requests in a short time which means more resource which more resource means more instances of our database right. So scaling instances that is the second problem. So sharding solves both
of these problems. And how does it solve? So let's take the orders table. And what do we do? We shard this table. We partition this table. Which means we divide this table. Say this is the whole table. And instead of billions, let's imagine just to understand this concept. It has 10 rows, right? And if you imagine one row as a representation of 1 billion rows. So it has 10 million rows
but we are talking in the context of 1 to 10. So it has 10 rows and what we do and what we did we divided this into two parts. Two parts and this is one database instance and 6 to 10. This is the second database instance. So we took the orders table and we divided the table in a way that they are physically in two separate database instances. Right? And how did we divide it? So every time you think about sharding one of the very tricky problems is figuring
out what should be the sharding key which means that what should be the parameter according to which you can divide this table. How do you actually decide that this particular data should live in this instance and this particular amount of data this should live in this instance. You have to come up with some kind of criteria that we will divide according to dash. Now in this instance what you can do is in first database instance for first physical database you will only hold the
orders of January to June and in the second database instance you will hold the orders from June to December. Right now this order date has become your starting key. Now what is the benefit? You have two database instance, two physical database instance which has the clear benefit that two database instance can serve more amount of requests per second. Right? If this is our internet, this is our server instances. Then since
we have two databases now, they can handle more amount of requests. So that solves second problem. Now the first problem which was query latency since instead of dealing with 10 rows instead of dealing with 10 billion rows this particular database instance only has to now deal with 5 billion rows or five rows. So query latency also in turn becomes less because the amount of rows become less. Now of course this was a very simple example to show the concept but depending on your volume you can also shard this by month also. For
January, you can keep one instance. For February, you can keep another shard. For March, you can keep another shard. Right? So, in your back end or in some place in the your in your routing layer, in your backend in your database routing layer, you can decide and here before making the request, you decide at this point that which particular shard holds this data and you send the request to that particular shard and it sends the response back and you return it back to the user. So that's pretty much what
sharding is which is basically means dividing your data into multiple physical database instances instead of keeping all the data in one table you divide the table physically right so those are two important concepts that you will keep hearing replication and sharding read replicas and database shards of course there are a lot of technical complexities behind the scenes of how to interact with them how to create the charts but now we are just focusing on the concepts the things that you should know about but to actually
learn them how to implement them how to understand them in a deep technical way you can do your own reading now moving on talking about current times which is current time in the sense right now it's 2025 December the current trend when it comes to scaling databases are distributed databases right some of the famous examples are planet scale which uses pus behind the scenes which is primarily MISQL
based database. We also have neon when we are talking about posgress. Neon is purely written in rust and it is considered as a serverless database. Similarly, we also have cockroach DB. We also have Yugabyte, right? A lot of distributed databases which handle all these problems. the problem of sharding, the problem of replication, the problem of distributed transactions and everything all the complexities that
arise when we talk about database scaling they automatically handle that. So talking in a realistic scenario if you are a backend engineer you just go ahead whatever database provider that you want to use with whether it is RDS or whether it is neon planet scale whatever database that you want to use obviously it is never recommended especially when you're starting out especially if you don't have very deep expertise in databases and database administration or database scaling then you should not roll out your own database infrastructure because as we
have already talked about the replication sharding and the database backup is another problem. How do you keep your users data safe and in multiple places? So there are a lot of complexities when it comes to databases and obviously it is not recommended that you roll out your own database infrastructure even though the fact that you know all these concepts is good and it is advised but should not roll out your own database infrastructure unless you know what you're doing exactly. So
realistically you will just choose whatever database provider that you want to go with depending on their features depending on the pricing depending on your team's preference AWS neon plan whatever and you'll just sign up create a database and you'll get a URL of the database username password whatever credentials the database has you just plug it in in your database in your back end and you'll start interacting with it right you do not have to think about replication you do not have to think
about sharding and anything. But you have to know this concept so that you can go into your database console, your database provider console and you can configure things. You can configure that how frequently you want to take database backups in what regions of the world that you want to keep read replicas and how do you want to plan your sharding. The actual activity of replication, the actual activity of sharding, the actual activity of distributing your replicas that you don't have to do. But you still have to understand what all these terms
mean, what all these concepts, what all these techniques do so that you can express how do you want to configure your database. That's why it was important that we talked about database scaling. Even though in a realistic scenario, you will not actually have to do any of this as long as you understand what all these terms mean. So far we have discussed caching within your data centers. But there is another layer of caching that we have not discussed yet and that layer operates on the global
level. We think about geography when we think about this layer of caching and we call that cydians content delivery networks. And to understand why CDNs matter, I actually have a video about CDNs in depth, which is of course in a different context. I was talking about how serverless platforms like Versel work. But if you want to learn more about Cydians, if you want to understand cydians in a much more technically
deeper way, then please watch that video. We'll also talk about cydians in this video but not in that much depth since that video was completely about cydians. Okay. To understand why cydians matter, why do we need cydians in the first place, you have to understand one fundamental constraint or one limitation of our world or physics per se which is the speed of light. Now light travels at around 200,000
kilometer per second through fiber optic cables. We are not talking about lights speed in space but in fiber optic cables which are at the end of the day the speed of our internet. We cannot cross this speed ever because whatever data that is transmitted from one country to another from one part of the world to another end of the day it is all going through under sea cables and that happens at this speed. This is our cap
200,000 km/s. So if a request is going from Tokyo to a server which is in Virginia or North Virginia is the region US East one where you'll find most of our commercial applications most of your famous servers are hosted in they have instances there is US east one which is north Virginia for this request to make a round trip roundrip in the sense to send a request from the browser and getting the response back from the server. to make
this whole round trip which is roughly around 20,000 km and at the speed of light in optical fiber cables that's roughly around 100 milliseconds and this is our cap this is our physics cap which means no amount of optimization no amount of technology at least the kind of technology that we have access to these days it cannot beat this for making a roundtrip the minimum amount of latency that the user is going to face for a request response sound trip is going to be 100 millconds for a
user based in Tokyo. And 100 milliseconds honestly does not sound much but this we are talking about only the round trip of sending the request and getting the response back. There is still stuff happening after getting the request. The first thing obviously is the routing layer and the routing layer is pretty fast if you think about it. Mostly it's a regular expression based routing depending on which resource the user has sent a request to and depending on that the respective handler picks it up but after the handler whichever function that is supposed to deal with
this particular request picks it up the first step that happens is a des serialization process which means taking the HTTP message and deserializing it into the data structure of that pro programming language. If it is something like NodeJS, it'll become a JavaScript object from the JSON. And same for if it's a go programming based back end, it will become a strct etc. Right? And usually this des serialization process of JSON takes some time depending on how
large the payload is, right? And that is still not the largest cause of latency. That is just the entry point. After that, we have our service layer. We run any some amount of business logic and that typically involves dealing with our database and databases again depending on what is the distance between your server and your database and assuming the best case scenario. So server and database are in the same region in the
same data center or similar kind of setup. Still for a midlevel to complex query it might take somewhere around 50 milliseconds to 100 milliseconds. On top of that if there are any external API calls for that particular API then again that might cause another 200 milliseconds. If you combine all that that 100 milliseconds which is the minimum amount of latency that a user from Tokyo was about to face that
quickly becomes somewhere close to 500 to 800 millconds right so even if you forget about all this which we already discussed what kind of optimization that we can do on database level what kind of automation that we can do on the caching level using things like inmemory caches like radius etc etc the last cause of latency it all still boils boils down to the physical distance between your user and your server the geographical gap and this is exactly the problem that CDNs are supposed to solve. So what we do
instead of the requests that originated from Tokyo instead of going to US instead of going to the US server what it does we place the CDN edge locations or also called as pops points of presence in the same region let's say in Tokyo. So instead of traveling all the way to 20,000 km for the whole round trip, this will come down to something like 100 to 200 km and the latency will come down from 100 millconds to
something like 2 to 3 millconds. That is a huge performance win. Even though from human perceived time 100 millconds in 2 3 mconds does not make a lot of difference but since we are talking about scaling and performance this is a huge difference coming from 100 to 2 millconds. So that is one advantage of CDNs. We take a particular piece of content and we cache it at CDN and we place that particular server that particular node which is going to serve the content which is going to send the
content back to the user close to the users. Right? So one obvious win of using CDNs is the latency latency. Second is the load and since we are placing all these CDN nodes these points of presence closer to users one node in Tokyo one node in let's say Mumbai another in Singapore and depending on where the user's request is coming from that particular CDN node is going to send the content. Of course, there is another layer of logic where we decide what particular piece of content are we
going to cache in CDN. That is a different discussion which you can find in the CDN video. But assuming that we have already done that, we have already place this particular piece of content in CDN. Then the user just requests for that content and the CDN sends it back. And because of that our primary server which is in US let's say it does not get as much amount of traffic. So the traffic gets distributed across all the CDN nodes and it gets 50% less amount of
traffic roughly speaking. So that is another win of using CDN. Your primary server is not going to use a lot of resources. you don't have to worry about horizontal scaling as frequently if you're using CDNs and if you are prioritizing caching your content across CDN nodes. So the obvious next question is what can you put in your CDN content and what kind of content that we usually put in our CDNs and one obvious starting point is static content. What do you mean by static content? things like our JavaScript bundles or CSS bundles, HTML
files and images and videos, fonts, right? These kinds of contents, they don't change as often. Usually when we deploy our front-end application, that becomes a bundle of JavaScript, CSS, and HTML. Assuming that it is a SPA, single page application, something like React. We take that bundle and we cache it across CDN nodes. So when a user requests for our front-end application instead of going to the primary server where uh it gets the request it sends a
request to something like an S3 bucket and serves it back that particular request will go to the CDN and since the CDN is closer to the users that response will be faster. So that is the idea since static content does not change as often caching it in CDN make a lot of sense and this is typically how we deploy our SPS single page applications or static site application like blogs it etc and also the images fonts etc. These are also called static content and a very good candidates for caching in CDNs. But apart from that, we can also
cache a few API responses. For example, if you have an e-commerce page and you have a product catalog and you know that this catalog is not going to change anytime soon or as often or as frequently. Of course when we are using CDN there are also techniques for invalidating the cache or in CDN terminology purging the cache we have used cloudflare CDN which is one of the largest CDN providers they have a lot of mechanisms using you can provide tags to
your CDN content and depending on a particular condition let's say some data change or a user uploaded a new blog then you can take a tag let's say a tag which had the user ID and all the blogs of that user. You can take that tag and you can say that let's purge all the cache which means delete all the cached content of this particular user all the blocks of this user and fetch new data from our primary server and cach it again in the CDN. So depending on all these conditions you can of course
decide when do you remove stuff from your CDN cache and when do you load new data. So those conditions are there you don't have to worry about serving old content etc etc. That's a different strategy of how to manage CDN content. So API responses are also cached depending on our use cases. A third use case of CDNs especially the cloud flares USB when they are selling their CDN is the security perspective. One of the famous modern-day attacks are DOS attacks, which means that an attacker
has control over a lot of computers, over the internet, a lot of bots using some kind of malware. And the attacker uses all these bots spread across the whole world. Let's say it has control over around 20,000 bots using some kind of malware. So what it does if it wants to target your server, it will write some kind of script. So that let's say this is your server and computers nodes from all over the world will start sending traffic to your server and since
our servers have a particular limit on the amount of resources eventually those limits will be crossed and either our server is going to crash if there are some hard limits or if we have configured horizontal scaling all the advanced scaling techniques then what our servers will do it will keep spinning up new instances to deal with all that traffic and in turn will incur a heavy amount of charge. So you will like $50,000 for one day because all these new instances were spun up and it
catered to a large amount of traffic. So you will either face damage in the sense your server will crash or financial damage. So DOS is still very relevant. You every other day you read in the news that this particular server was attacked with the DOS etc. So Cloudflare CDN one of its offerings is you have your server and you place the Cloudflare CDN in front of your servers. So all the traffic goes through Cloudflare and then it goes to your server. If there is some
cache content it returns from this point otherwise it sends to your server. And this cloudfare CDN layer it also works as a security layer in the sense if in a very short time span it gets a lot of traffic they have a very advanced methodologies to detect these kinds of attacks because they've been working on this for years. So the moment they detect this attack there are certain number of steps that they trigger things like asking all your users to captures right all the suspicious users etc etc
and on top of that all this traffic it cannot really do any damage to your servers because there is another additional layer in front of your servers and since Cloudflare CDN is one of the largest CDN providers in the whole world all this traffic gets distributed across their nodes so no serious damage can be done even though the traffic is in terabytes terabytes or pabytes of data because Cloudflare's CDN network is so large. So that is the third advantage of using CDN. Another interesting thing that keeps coming up
while we are talking about CDN is edge computing. Now traditionally the word edge used to mean that so we have our CDNs we have our users in Tokyo, India, Singapore and we have our CDN nodes for all these regions and these nodes are usually called edge nodes because they are at the edge of the network right because we interact with our internet service providers ISPs and usually CDNs are strategically placed in
collaboration with our ISPs so that when a user request triggers at the edge of the network. At the first point of interaction of the network, we place our CDN nodes and that is the reason the CDN responses are very fast. They are strategically placed at the edge of the network, the first point of the network. That was the initial use case of this word in the context of CDNs, the edge nodes. Now in the modern times, a different term, a similar sounding term which causes a lot of confusion is edge computing. Now edge computing is also in
the context of CDNs. But when we say edge computing what it means? Traditionally CDNs have always been used as a layer for sending static contents like JavaScript or images or videos etc. All the static content. So a user from Tokyo they request a particular video or they request a front-end application with bundle of JavaScript CS and HTML. Your CDN gets that request and immediately finds the file, sends it back. There is no processing layer happening here. There is no complex
processing happening here. You request the exact file, the name of the file and the CDN sends it back. It does not think about it at all. Now what edge computing means, you send a request and there is some additional processing happening here. Then it sends whatever response that you're expecting. Now this processing that happens at the CDN layer at the edge node layer is called edge computing and since our traditional servers they are placed in primary data centers like US east and all the big
regions that we have available in or GCP these edge nodes or CDN nodes the number of CDN nodes are and always will be greater than the number of primary data centers because primary data centers they require a lot of resources they require a lot of geographic strategy and a lot of investment both financially and operationally. But since CDN nodes are placed and created in collaboration with your ISPs, so ISPs
are primary infrastructure for placing your CDNs. Even though there are these big players like Cloudflare and Aless they have their own CDN infrastructure in collaboration with ISP but traditionally speaking CDNs have always been an infrastructure primarily based on ISP's infrastructure your internet service providers and since there are more number of these CDN edge nodes and they are very close to users you send a request to this edge node the processing happens very quickly even though the processing itself might take the same amount of time that our primary data
center or primary service would take. But since the request and response roundtrip latency is less, it feels like our edge computing whatever edge computing level responses are faster than our traditional server based responses and for good reasons. One solid example of one of the most used use cases of edge computing is authentication. So again take the same example of an us user in Tokyo and our primary data center in US and for authentication if we are going stateful
authentication we send whatever session information that we have using cookies etc. So the user sends a cookie to the server. The server checks using the session ID from the cookie. It checks from database or radius whether the cookie is valid or not. The session is valid or not. And if it is valid, it the request passes through and if it is not valid then it sends an error response something like 401 which is unauthorized. But the whole round trip spend around 100 milliseconds just to get an error response that you are not authorized to make this request which
seems like a lot of waste of your bandwidth and your service resources just to send a rejection. So for this use case what alternatives have come up is instead of sending this authentication based request directly to your primary server you send it to your edge node. So this is your edge node whatever CDN node which is capable of doing some kind of processing. So you send it and it checks the same amount of logic that you would have done in your server layer you do it at the edge
layer. You check the session information you check the session ID if it is not valid instantly send 401. Now instead of getting a 401 response after 100 millconds, you get it in 2 to 3 millconds and your server was not disturbed with a lot of unauthorized based requests and only if this user is authorized then we let the request pass to our server. So the server only deals with legit requests which in turn reduces a lot of traffic unnecessary traffic. So that is a very good and famous use cases of using edge
computing. They can also do a lot of other things. For example, let's say they get a request and depending on user location since the edge nodes they are very close to the users they know which region they are serving. So the request coming from Tokyo they can check the headers and if the request says that the user's browser language is something like Japanese. So instead of sending the content let's say it's a blog instead of sending it in English or instead of doing the localization based configuration of your website in English they can directly send a response. This
is a Japanese version of your website and all of this happens very fast. So all this users location based configuration customization and user preferences. This is also a solid use case for using edge computing. But one question that you might have is if edge computing is so fast and so useful then why not use everything? Why not use edge computing for everything? And the reason for that is constraints. What kind of constraints that we are talking about? As I've already said, edge computing or CDN nodes have always been an infrastructure of our ISPs. So you can
imagine a request from a user from Tokyo, it goes through and the first point of contact is going to be the ISP, the company which provides you with internet. So let's say if you're from India, so a company like Etel or a company like ACT or whatever ISP that you're using that gets your request and they have a number of routing layers, hobs as you can call it. It goes from one point then another point and then another point then eventually it crosses the undersea cables and goes to another country then they have another ISP there
and eventually it reaches your server. Right? So all these routing happens throughout the way. So traditionally CDNs are usually let's say an ISP has an infrastructure here to route the internet request. So at this particular infrastructure they'll have some collaboration with a company like Cloudflare and they will provide a CDN level functionality here but their primary responsibility has always been routing the internet requests. CDN is just something they do in collaboration
with all these providers like Cloudfare and AWS. So the kind of infrastructure they provide it is not very robust which means that they have limited amount of RAM. So instead of having like 8 GB of RAM or 16 GB of RAM they'll have something like 1 GB of or one core CPU. Right? So the resource constraints are a big problem when we are talking about edge computing. That is one point. The second point is the whole idea of edge computing has always been very low latency based computing. So one primary
example of edge computing is Cloudflare workers right one of the most famous edge computing based infrastructure that we have available today and Cloudflare workers they use V8 isolates a JavaScript runtime from Chrome and they use this V8 based runtime to process your request to do any kind of computations and this edge computing environment comes with a lot of constraints like it cannot interact with your file system it cannot do things like TCP protocols etc etc. So all these
resource constraint all these runtime processing constraints are the reason that we cannot offload everything to CDN nodes our primary servers are robust data center infrastructure they have to be there but we can offload some amount of logic strategic logic which makes sense to be placed at the CDN level at the edge computing level so that the edge nodes and our primary servers can work together to provide a very seamless experience to our users. Things that we
already discussed like authentication or user customization or validation or routing which server the request should go to etc. All these things they are very good at but they cannot completely replace our primary servers as of now. Moving on to our next topic which is asynchronous processing. Whenever we talk about reducing the latency of our backend application, at least reducing the perceived latency, the kind of latency that the user faces while
interacting with our application on the front end side, then asynchronous processing is typically one of the solutions that we opt for and it is not something that we implement at the later stage. For example, horizontal scaling is something that we go for at a particular threshold. Let's say we have crossed 10,000 users, 50,000 users. When vertical scaling does not make any sense, we go for horizontal scaling. But asynchronous processing is one of those solutions that we typically opt for at
the start of our application only because of the very obvious benefits that it has that we are going to talk about. What do we mean by asynchronous processing? Typically what happens, we have our users, they interact with our server using a browser. So for a normal API call, a normal HTTP API call, they send a request. We do some amount of processing and we send a response back. For example, let's say the user is on their profile page. They updated the name from A to B. We take that, we call
a DB query by interacting with the database. We run update update user set name equals to B returning user do ID whatever. and that database operation is successful and only after that database operation is successful, we send a response back. We cannot tell the user that okay your username has been updated before the database response is successful. Right? It has to be only after the database operation is successful. We can tell the user that
okay this is a 20 response. This is a successful response which means that when the user does a refresh when they hit a refresh page or if you have automatic reloading of the user profile page they immediately see this change their name getting changed from A to B. So this is a typical experience that we have in most of our operations. But there are some kind of operations which do not have to be synchronous in the sense synchronous behavior. I'm not talking about synchronous processing but
synchronous behavior. The users do not have to see the change immediately after the response. So for them what we do so let's take an example in most of our SAS applications there is a functionality that you have your workspace or you have your team if you have used a software like Jira or notion or pretty much all the SAS applications these days offer this functionality of having your own workspace and team and you can add more members of your team to your workspace so that you can all work together work have access to the same kind of
resources So usually in our front end there is some kind of UI where you go and you type the email. Say you want to invite one atgmail.com and you hit enter. Okay, invite or whatever the button is to submit this request and to invite this user. So the first thing that happens is obviously the browser JavaScript converts it into an STP API call and you get this in your server. All right. Now in your server the first layer is routing layer. Then your handler gets it and it gets the
request body and in turn it finds out the email that you want to invite user one atgmail.com. So what would this processing look like? There are the obvious layers of validating this particular request authenticating that the that user but we'll ignore all that. Purely focusing on the service layer on the pure business layer and the database layer. This is how it would look like. First using our database we will check whether this particular user is already part of this team or not just as a database level validation layer and once it passes that the next thing we do we
save this database entry in a table in a table something like invites table or team members table or organization users table depending on how you are structuring your database in some kind of table you will add this data this is the particular email that the user has invited to and the status of that. There will be other fields and these are the primary ones. The status is it is pending by default. It'll be in pending state. The user has not accepted the invite yet. Then accepted then rejected
whatever logic that you have. Okay. Once that is done, the next phase typically involves sending an email to this particular user email user 1@gmail.com because that's how they'll get the invite and that's how they'll be able to come to the platform and accept the invite. But for that we have to send an email. So now from your backend server all this processing is very fast because you are doing whatever processing that you have and you are interacting with your database which is like assuming closer to your server when talking about
the data center region. So this typically consists let's say somewhere around 50 to 80 millconds or max to max 100 millconds. Now after doing all this we have to send an email. Now to send an email we have to usually send an make an API call to an external email provider something like Mailchimp or send grid or resend whatever email provider that you're using they all pretty much do the same thing they take the email they take either a template ID or a readym made HTML template and you make an API call and they send a successful response back
but this typically since we do not have any control or we do not have any idea about where this particular mail server is and how their infrastructure consists stop. We can assume that this external API call usually takes somewhere around 200 to 300 millconds. Now our own processing was around 100. Then we on top of that added another 300 millconds for this external API call. And after we got the response from this external email provider that okay this email request has been processed then we send
a response back to the user. So the user enters the email, clicks on the enter button, then they see the loader for 400 milliseconds, then they see a tick mark or a toast message. Okay, this you have successfully invited this user and we refresh the invited users list in the and they can see that this user has not accepted the invite and it's in a pending state all that but this whole interaction took 400 millconds but if you think about it after doing our own processing after adding this particular entry in our database which was around 100 millconds what we can do instead of
making the API call making the HTTP call to our email provider to send the email after doing our own processing we can send a response responds back. It's a 200 response and the user will see a tick mark that okay this user has been successfully invited and they will see in their UI that okay this is in pending state all that but now this whole user interaction has come down from 400 millconds to 100 millconds which is a very snappy experience even though 400 is itself not that slow but you can assume there are other API calls maybe the email provider is down whatever
reason is that but our goal was to reduce the perceived latency of the user the user who is interacting with our system and that has come down significantly on a server level. How we handle it before we send a successful response back. Usually how we implement asynchronous processing is we take that request of sending the email. So we call that is some kind of task or some kind of job. So if you have watched the background job processing video then you
would understand what I'm talking about since and obviously that video came before this. So I'm assuming that you already have the knowledge about message cues and background job cues. So typically we take the task of sending the email or the job of sending the email and we push it into queue. Now this Q can be something like radius Q or Rabbit MQ these are the most popular options also other ones like Kafka event based processing etc etc lot of options but usually Reddis Q or Rabbit MQ based Q library will do the job. So our server
as a producer we call it because it is producing the task or it is producing the job any kind of job. Now it pushes task into a queue. This is a radius Q or rabbit MQ based cube and there is another phase of this Q which is a consumer. Consumers or we also call them workers. Now there is a distinction here. this consumer or this worker. It can either be in the same code that is running your server which means it just
spins up a new worker based on the same code or it can be a different code base itself so that when you have a lot of traffic you can also horizontally scale your workers. That is another infrastructure choice usually people make when they're scaling their infra. So this workers these consumer they pick this task out of the queue when they detect one and they perform the job or they perform the task whatever handle logic that we have we take the email we make the API call to send grid resend
whatever email provider that we have and that's done but for our users they do not need to know all that because sending an email or inviting a user into the platform this is not a kind of operation that the user needs to see the result of instantly because anyway the user will get an email they will eventually come to the platform and they'll accept the invite or reject the invite. So this is not a synchronous behavior-based operation or synchronous computing synchronous behavior based operation. So this is a very good
example and it is one of the most famous examples of when we opt for asynchronous processing in our backends sending emails sending notifications also users don't expect the moment they perform some task in the next millisecond that notification should be there. They don't expect that. That is a standard user experience. Same way another good example is video processing. So let's say you have a video processing platform. Let's talk about YouTube. So when you upload a video to YouTube, so YouTube usually has this kind of
interaction where you pick the file and you click upload and the upload starts. And while that upload is happening, you have to keep that tab open for obvious reasons because it is reading bytes from your local file system and it is uploading YouTube servers. But after the upload is finished, after the upload is finished, you don't have to keep the tab open. You can close the tab or whatever. In the background, what YouTube does after the upload is done, it pushes a number of tasks into the queue. Whatever Q logic they have, the most likely a very robust Qbased logic. But typically
imagining after the upload job is complete, it pushes a number of tasks. The tasks like generate thumbnails and encode the video into HD or generate subtitles. All these tasks maybe parallelly, maybe in sequence, whatever logic that they have. But the point is all these tasks are done asynchronously. You as a user do not have to wait or you do not expect to see all this in a particular time frame. Sometimes all these tasks happen let's say in 10 minutes. Sometime it might take 20
minutes depending on the server load of YouTube etc. But the point is all these are asynchronous processes that we as users expect that these will take time. So there is a category of tasks or there is a category of operations that are out there where you can safely offload them to an asynchronous process. um video uploads and image uploads, image resizing and we already talked about
sending emails. Another thing that you might have noticed is in some SAS platforms there is an option to delete account. Delete account and what does it mean by delete account? you are using a particular SAS for let's say 1 month or 2 months or 3 months and after 3 months that you decided that for some reason for some fault on their side or maybe you do not need it anymore for some reason you want to delete your account and never come back or maybe you want to come back after a year with a different account but the point is you want to
delete your account and deleting account usually means that deleting all your data in that platform which means that let's say it is a to-do based application so if If you delete your account, it means that deleting all your to-dos obviously the first thing. Then all your categories of to-dos. Then your profile, whatever profile information that you have used, your profile picture, your bio etc. all your calendar schedules. Basically, we have a lot of database tables and in a lot of tables we have data related to a particular
user. So imagine you are the backend engineer of this application, this to-do based application and a user clicks on delete account. So what do you do in a very naive approach? The browser makes the request of delete account. It's a delete HTTP API call. And what you do you run a number of database queries in transaction because either you want to delete all of them or you don't want to delete in it will be classified as a database transaction. So first what you do you delete all the categories of the
user that they have created. Then you delete all the calendar schedules. Then you delete the all the to-dos of the user. Then you delete the user's profile data. And at last you delete the entry from the users table which comes last because all these have foreign keys associated with this table. And finally after running delete queries in let's say seven or eight database tables you send a response back something like 200 and the user is logged out from the browser. But assume that this is a big platform or the user has let's say a
million tasks created because they've been using for last 10 years and they have a million tasks created and same for other tables also. Each table has a million rows that you have to delete. So obviously running a delete operation finding out all the data of the user then deleting them takes time. So each database table deletion takes somewhere around 5 milliseconds let's say. So if you deleted data from eight tables, it took around 4 seconds and adding some business logic related processing and
the request latency etc. The total roundtrip time of this request response life cycle was around 8 seconds. So the user after clicking on delete account they saw the spinner for 8 seconds. So 8 seconds is obviously a very bad user experience no matter what the operation is unless you have clearly specified in the dialogue that please do not close the tab or do not please wait for at least 10 seconds or 20 seconds which again is not something that you can expect from your users. So in that case also instead of doing everything
synchronously the user makes a request to delete their account your server does some very basic processing that checks whether this user exists or not they're authenticated or not which takes somewhere around 50 to 80 millconds and after that the server immediately sends a response back saying that your account is deleted and the browser sees the successful response and logs the user out. Okay. Now the user only had to see the spinner for let's say for a moment only 100 millconds and they've been logged out. But in the background what you do after doing all your sanity
checks checking the user exists or not or whatever you again push a task is called delete user with the ID of the user or whatever into the queue and later on the consumer can pick up the task and that can in turn run all these database queries one by one and it does not matter whether it takes 5 seconds or 10 second or 30 seconds because your user is not waiting for that response and it all is happening in the background after the user has been send a response. So this is typically what we
mean by asynchronous processing. One of the very very easy solutions when you are trying to scale your application or when you're trying to reduce the latency of application a very easy solution you just create a radius instance mostly a managed radius something like radius instance or opt stash which is a very good radius instance or we have these libraries if you're using a nodejs bull mq bull mq is a very famous background job based library and it uses radius behind the scenes radius pops up But it manages a lot of things like error
handling, rate limiting and all the things that you expect from a production grade robust queue. So you can use something like this and you can easily implement a background job a Q- based solution for your asynchronous based task. Of course the first thing is to identify which kind of task which task exactly can you offload to your queue. So as I already said things like sending emails, sending notifications, deleting user data, video uploads, image uploads, all these tasks where the user does not
expect to see the response immediately, you can offload to a queue and that in turn significantly improves your perceived latency. A very easy solution and a very recommended solution while you are trying to increase the performance of your application and reduce the latency. Now moving on to another important question which is micro services. Very fancy term and a very trendy term. If you are a backend engineer even a front-end engineer also at some point you will keep hearing this term called microservices because
apparently every company these days goes for this microser architecture. Whenever you ask around about how to scale your application, how to scale your back end application, someone is going to suggest you eventually that you should divide your application into multiple independent services or also called as microservices. That is the only way to scale your application or maybe you'll join a company one day which already uses this micros service-based architecture. So understanding this what are microservices and when do we use
them is important. Of course, dealing with microservices or implementing microservices is a very different topic in itself and it involves a lot of technologies and a lot of methodologies. We'll not get into that but we'll talk briefly about microservices and what role they play in scaling your wagon. So let's start with what we currently have right now which we call is monoliths. What is a monolith? A monolith is an application and whenever I say application I mean a backend application
because that's what we are talking about currently. So a monolith is an application which is considered as a single deployable unit. So all functionality does not matter whether it is authentication functionality or your order processing, your notifications, your payments, your web hooks, all these functionalities are just different different files in your codebase and all of them live in the same codebase. They interact with each other and they function together. And you can take all
of this every time you make a change. You take all of this and you deploy your an application which serves your users, right? One single deployable unit which runs as a single process or if you're horizontally scaling multiple process but one process consists of all these things together. They are just different modules in a same service or in a same application. And monoliths are pretty intuitive and pretty easy to manage because they are very simple to develop. They are very simple to test and they are very simple to deploy. All your code
lives in one place in your one codebase, one GitHub repository. Whenever you refactor, you make a change in one module and you go to the other module, make the change and you take all that and you deploy that. So refactoring is also very straightforward. Testing is also very straightforward because everything runs in one process and deployment is obviously simple from the start. Now the question is why would you consider anything else? Why would you even think about microservices? You can just take a monolith and you can horizontally scale it whenever you want
to scale your application. You always have the option of horizontally scaling your monolith application. So why do you need microser? Now microservices primarily they are not necessarily to scale your performance of your application but microservices are more about scaling your team. It's more about scaling your human interaction. It's not really about scaling your machine's performance. It's about scaling your team's performance. The humans behind the machine because monoliths have a
couple of problems. If you are working in a large team that is that is the number one prerequisite of having a micros service based architecture if you don't have a large team and when I say large it can be subjective but we are talking about more than 100 or 200 developers working on the same application that's the point when you start considering a micros service architecture so what are the problems of monoliths one obvious one is the deployment dependency and what do you mean by that let's say we have different different modules in our application. We
have our payment module. We have our order processing. It's an e-commerce based application that you can imagine. We have our notification based module and the people the engineers who are working on the payments module they implemented some critical change or some important user experience which change and they want to deploy that so that users can start using that. But while they do that there are some changes from our notification team in our main branch. Typically the branch that we deploy in production they have some
change in the main branch which is not ready yet it's not broken but it is not ready for sending to our users. Now of course you can say that there are solutions like feature flags. There are solution like planning your life cycle in a way that you only keep the ready to deploy things in your main branch. But in a very abstract way in a very oversimplified way you can imagine like you have some changes in one module but since it is a monolith application and since you have to deploy everything as a single module sometimes you will face difficulties. Again as I said these
problems you cannot relate unless you work in a very large team and the communication between different teams it's not as close because typically when we work in a startups 10 15 or 30 developers pretty much everybody knows what everybody is working on. The team works in a very closely coupled way. So we are able to plan our things our git life cycle our feature flag and everything in a way that this particular problem does not happen. If something is on main branch then we can go ahead and deploy it. But if it is a very lasting
500 developers or thousand developers you will start seeing this problem that when rapid development is happening in one module and another module is not ready yet this problem will start happening. So that is one the organization based problem. The second one is scaling which is an obvious one. Taking the same three modules for example our notification module let's say it is not as resourceheavy we just make some database queries and we just insert some database queries because it is inapp notification and maybe some
websocket based operation but our order processing logic or our payment based logic it requires a lot of resources a lot of CPU resources a lot of memory resources and basically harder resources if we get a lot of user traffic or if we are expecting a lot user traffic. If you want to be ready to the point that we cannot afford any kind of payment related failures or order related failures, then we have to scale our back end applications vertical scaling or horizontal scaling. But the problem is when we scale our application, we scale
the whole thing. We cannot say that okay since notification does not use as much amount of resource, we do not want to scale notification. We only want to scale payments. We cannot do that. We cannot scale independent modules. So that is another use case of microservices. You can since you have already divided services into different maybe code bases or maybe not but different deployable units and because of that you can scale different services depending on your expected user traffic or your current user traffic independent of other modules. Third thing is text
stack. So let's say you have a blog platform something like medium. You are an engineer of an platform like medium or hash node or dev all these famous blogging platforms and in your back end there are some parts where you have to deal with markdowns since is a blogging based application and you have to deal with markdown based cleaning or maybe some kind of website parsing and all these different stuff and for this you either have to go with a NodeJS app or a
Python app because in NodeJS and Python ecosystems there is a choice of thousands and millions of libraries for pretty much all the use cases. So you found out that for a particular operation let's say a markdown based parsing or markdown based rendering application there is a very famous library. So this part of your module which deals with content or especially a markdown content it absolutely needs a NodeJS based environment because it has to use a particular npm package but in a
different part in a different part of your same back end let's say you are taking the images that the user uploads and you do some kind of image manipulation or image resizing or any kind of image related operation and you find that using a programming language like Go or Rust makes this whole image manipulation based operation very fast. Since these programming languages are very efficient at CPUbound tasks and if you use something like Python or NodeJS for this, the latency becomes something like 500 millconds. And when you use Go
or Rust, it becomes 50 mconds. Though obviously the performance benefits are huge. But since for one module you need in the NodeJS ecosystem since you have to use an npm package or a Python library and another module where it is CPU bound you need pure raw performance of a systems based programming language. Now you face another dilemma that since all your modules are a single deployable unit, you cannot use different programming languages. And if you have microservices, you can just divide this
into a different service which uses NodeJS and this into a different service which use Rust and you deploy this separately and you deploy this separately. That is another advantage. So these are a couple of very obvious advantages of using microser not just for pure machine performance but mostly for organizational performance. Now the question is if microservices have all these performance organizational wins then why don't we use microservices as a default architecture and as we already know every solution comes with its
trade-offs. So microser also come with a lot of trade-offs, a lot of disadvantages also, a lot of complexities also. The first one is network since you have divided your application into different different services. Previously it was just depend different modules all of the code work together and was deployed as a single process and now you have divided into different service and they have physical distance between them. So previously it was just a pure function call. Now it
has become a network call. Whether it is a gRPC based network call or an HTTP network call, it can be any of them. But the truth is it is still a network call. So the latency will start showing up depending on how you're organizing your services. And latency is not just the only problem. The moment we talk about networks, there is obviously the question of failures. An HTTP call might fail, right? So you have to start worrying about how to handle your failures, how to retry etc. How to
configure timeouts. So dealing with network is a huge complexity that comes the moment you go for a micros service-based architecture. Second is microservices are primarily a distributed systemsbased architecture and when you're trying to understand a request let's say you are trying to debug a particular request and you find out at the entry point let's say the request went through your load balancer then it went through your routing layer then it went to your order service then
it went to your payment service then it went to your notification service etc etc. So if you want to debug a particular request, you have to check logs of four different services primarily four different services at the same time to understand a simple request. So debugging becomes a little complex. You have to opt for distributed tracing based tools that we have and you also have to architect your applications in a way that they make it easy for debugging a single request across different different services. Then third is data consistency. When we are talking
about very complex micros service based architecture, each service eventually holds an instance of its own database and as we have already talked about when we are talking about scaling your database or different different databases number one concern is consistency. How do we make sure that change in one database even though we have features like replication etc etc but still there is some amount of replication lag that comes the moment we talk about different different database instances. So that becomes another problem you have to start worrying about
replication lag or data consistency across different different databases and there are other disadvantages of microservices that you can read more about. But the question is when do you decide or how do you decide that microservices are a very good solution for you? As I have already said large teams are one of the mandatory requirements before you even consider a micros service-based architecture because large teams intuitively have very clear boundaries and microser architecture itself has very clear boundaries so that human organization
and this technical organization aligns very well. The second obvious criteria is different scaling needs. As we talked about, if one service needs to scale independently, then something like microservices makes a lot of sense. Then technology, if you need to use different different technology for different different parts of your back end, then also microser make a lot of sense. But all these complexities that we talked about in microservices, they outweigh the advantages. So going for a micros service-based architecture is a very big decision. And unless you have clear
answers for all these questions, whether you have a large team or not, whether you have very specific technology requirements across your different different parts of your back end and your deployment velocity, etc., etc., going for a microser based architecture is typically not worth it. And the last thing that I want to talk about in the context of performance and scaling is serverless which has in modern times become the go-to way of scaling any kind of applications especially backend
applications with the advent of platforms like Versel, Netleifi, Cloudflare which have popularized serverless computing to a large extent. I have created a different playlist where I talk solely about serverless computing, their origin, how it all started and how platforms like Versel work, how serverless databases work and everything related to serverless. If you are more curious about how serverless computing works, you can of course give that playlist a watch. But here we are
going to talk about the intuition of serverless since we are already talking about scaling and serless is such an important part of that process. Now before we talk about serverless computing we need to understand what came before that. So when you run your application a backend application it runs on a server which means some kind of VM inless EC2 or whatever cloud you're using it runs on a server which has an operating system like Ubuntu some kind of Unix distribution. So traditionally what you do you take a VM
from a cloud provider which comes with an operating system like Ubuntu. You configure or you provision your server there. You install whatever software that you need like NX or Docker or containerizing your application. Whatever stack that you have for deploying your application. You install all those applications. You configure it so that it can fetch your code or it can build your core or it can run your code etc etc. configure that particular server with your application and you keep managing that server how it works
whether the configuration breaks or not whether the DNS is working or not and all of that you are responsible for managing your server throughout the lifetime as long as you are using that server now this whole process this whole model of deploying our application or provisioning our server has been working for decades now you know exactly what you have with a particular VM or a particular machine with two cores of CPU or four cores of CPU or 16 GB of RAM, 32 GB of RAM or 4 GB of RAM and you have 30
GB of SSD capacity or hard drive whatever you want to call and what network capacity that you have something like 1 TB in a month or whatever that your cloud provider is providing you and your application runs on this particular machine with this limit with 4 GB of RAM, two core CPU, 30 GB of hard drive space and one terab terabyte of bandwidth per month. You know exactly what you're getting, the exact machine that you are getting and your
application starts serving starts processing requests as they arrive. Now the challenge that arises with this kind of setup or this kind of model is first one is capacity planning. What do you mean by capacity planning? It means that you need to decide beforehand before you even start serving your users. You need to decide how many servers if you are going for horizontal scaling how many servers that you need and for each server what should be the capacity of the machine how much RAM it should have how much CPU cores it should have and
how much hard drive capacity that it should have all of this you have to decide beforehand which means that you have to predict your user traffic you have to predict your users's behavior before you even deploy your application why because if you provision too little let's say you only provision for 4GB of RAM for your server and a huge traffic spike happens because of some kind of online campaign or some kind of blog and your server is not able to process that
many requests per second and it starts crashing. So either that or your server becomes very unresponsive. Every request takes 3 to 4 seconds to send a response back. So your users either they will start experiencing slowness or they will start experiencing a lot of errors, a lot of error toast messages in their front end and because of that they will start quitting your application. If they are new users then you lost a chance for
them to become your paid users. But if they are already paid users they might be motivated to quit your platform and start looking for alternatives. The point is you start experiencing losses either financial losses or loss in terms of reputation because you could not predict your user traffic or you could not plan what kind of traffic that you might receive and you could not plan the capacity of your server according to that. And the other side of that is if you provisioned too much. Let's say you
did not want to risk all of this and instead of going for 4 GB RAM, you went for 32GB of RAM and the traffic spike happened and your server got request and it was able to send all the responses back. But since you overprovisioned which means you configured a capacity which is much more than what you needed. You only needed 8 GB of RAM which would have done the job perfectly depending on the user traffic. But since you went ahead and configured 32GB of RAM for your server instance, you had to pay for
this machine for as long as it was running. Let's say you ran it for 1 week. then it cost you around, let's say, $5,000 for running at 32GB RAM of capacity. Even though you only used 20% of the capacity, if you had configured 8 GB of RAM, then it would have only cost you around $500. But now it cost you $5,000. Why? Because you could not predict your user traffic. Again, it does not matter whether you underprovision or you overprovision. In both cases, you suffer some amount of loss because predicting is obviously
impossible of your user behavior. There is an obvious solution to this and it has been around for a long time which is autoscaling. Autoscaling. What is autoscaling? Autoscaling basically means depending on your CPU usage. Let's say you initially provisioned only two servers of 4 GB RAM each, a lot of users started coming. So these two servers resource usage will start going up and an autoscaling config will detect that these two servers have already reached 70% of their memory usage. So what it
will do it will spin up a new instance. Now all these requests will start getting distributed across three servers and the memory usage of each server will come down to 40%. So autoscaling is definitely a solution for our problem of underprovisioning or overprovisioning but it again comes with a couple of drawbacks or a couple of trade-offs. Number one is time. When we say your autoscaling config or your autoscaling infrastructure, it spins up a new server instance. What it means is it starts
provisioning a new server. And what that includes first booting a new operating system. You are most likely using some kind of Linux distribution something like Ubuntu. The first thing that happens when we say we are spinning up a new instance through our autoscaling config is booting up the operating system Ubuntu operating system. Second, configuring your application which means building your code or fetching your code and starting your application in a process then exposing that process to the internal network to our load
balancer etc etc whatever it takes to configure our application and all of this depending on what operating system that you use, what programming language that you use, the build time and everything. All of this can take somewhere from a few seconds to a few minutes. So if a lot of users start sending you request in a very small window, let's say in 3 to 4 seconds, your autoscaling infrastructure might not have enough time to scale because of this constraint of how much time it takes to boot up a new server instance. Second, scaling has its limits. Whenever
you are configuring your autoscaling infrastructure, there are two fields that you have to set. One is the minimum instance and the second is the maximum instance. If you set maximum too low as we talking about earlier now you're back to your underprovisioning problem which means after a certain amount of instance let's say after 10 instances you set in your autoscaling infra that you cannot scale beyond 10 instances. So now you're capped at this and if you get even more amount of user traffic then they'll
start experiencing the same unresponsiveness and errors. If you set the maximum too high, let's say you set the maximum as thousand in the sense that you don't care how many instances you spin up, but you have to cater to all of your users traffic. If this happens and some kind of attack happens or maybe it is a genuine user traffic, let's say some Black Friday sale is happening. Now your user traffic because of your user traffic, your instances actually went up till let's say 500 or 600 and because of that you incurred a charge of something like $100,000 in one
day. and now you're beyond your budget. That is a completely different problem. Third is a very serious problem which is the reactiveness of autoscaling which means that the way autoscaling works is let's say you have two instances which we have already talked about and the memory of age is 4 GB RAM. The autoscaling infrastructure will only start spinning up new containers or spinning up new instances of your server when it starts seeing that your memory has crossed 70% threshold. Now by the time it starts a new server instance you
are already overloaded because it is reactive. It is not proactive. It cannot predict the user traffic beforehand. It can only spin up new instances when it knows that you are under load. And if that window of time is too less then you cannot say that you will be able to scale in time. You might still experience crashes or errors. But even though we have all these limitations of autoscaling, autoscaling is still one of the most standard practice for scaling our applications. But the other problem
which is the the always on cost even with autoscaling you have a minimum. Now if we are talking in the context of horizontal scaling maybe you have two instance minimum but even if you talk about vertical scaling or you think about in the context of a single machine you still have a minimum which is that machine's configuration let's say this machine has 4 GB of RAM two cores of CPU 30 GB of hard disk this is your minimum it does not matter whether you have a 500 requests per second or you have zero
requests per second you are constantly paying for this server for this particular configuration which is your machine's minimum capacity. So if your application is something which receives constant and steady amount of requests a very predictable flow of request then this model definitely makes sense. The traditional serverful model the server which is always online even if there is one request it can still cater to it and it can still send a response back for that kind of setup. This model makes a lot of sense which is typically most of our SAS applications but because of some
requirements some kind of setups where this does not make any sense which does not have very predictable amount of user request people have come up with a new model which is called serverless. Now simply speaking what serverless means is you do not worry about servers at all. It's not like serverless does not include servers. They definitely include server otherwise there is no amount of processing that can happen. We still need a machine. We still need a VM. We still need some kind of operating system
and we still need to configure our application in that operating system so that it can start responding to requests. But the difference between serverless and a traditional architecture is you do not have to worry about the machines itself. You do not have to worry about the operating system. You do not have to worry about whether your server has 4 GB of RAM, 8 GB of RAM, nothing. And you do not have to worry about configuring your server to work with your operating system or to work with your server. The only thing that you manage, the only thing that you
provide is you provide your code or in serverless terminology, we call it functions. Particular functions which will be triggered based on particular events. These are the two core functionalities. These are the only two things you have to worry about. These are the only two things you have to manage. The way the computing model differs from traditional computing is let's say you get a request and there is some kind of front here. You of course do not go to your function execution layer, your serverless layer directly.
Usually we have what we call as an API gateway here which means that it takes HTTP calls at this point and depending on what route that we are trying to access it spins up a particular function. So it maps routes to the functions address since we as developers we only worry about functions and the events. So the events are these routes and the functions are the code that we write. So API gateway works as a middleman which takes a particular user
request. If we are talking about traditional backend based request response cycle comparing with serverless this is how it would work. Then APA gateway lives here it takes an HTTP request depending on the route it spins up a new function. So what does it mean to spin up a new function or spin up a new uh serverless function is traditionally what we were saying we have a server it has some amount of capacity let's say 4 GB of RAM two core processor 30 GB hard disk etc and it is always ready to cater to any amount of requests whether it is zero request
whether it is 500 request it's always there always present always listening for requests and we are paying for this how serverless works we do not worry about the machine's capacity or the machine at all we just Okay, if there is a request for this route, we want to run this particular piece of code. That's all we say. Now the provider, our serverless provider, whether it is lambda or cloudflare workers etc etc. Now they handle this machine handling part. So when a new request arrives, let's say it is the first time you deploy your server, no machine is
provisioned for you. You just push your code and it's just there. But the first time a new request arrives or a new event happens which needs the code running the serverless infrastructure your serverless provider it takes your code whatever that you have provided it spins up a new instance this machine now we don't know what machine is this or how much capacity it has or whatever any information about the machine just that our code is going to run in some machine and it is going to send a response back and this happens only after we get a request and After we send the response
back either depending on your configuration of your serverless infrastructure provider it lives for a few seconds let's say 3 5 seconds or 30 seconds or it goes to a pool of connections so that when the next request comes instead of spinning a new machine an existing machine can cater to that request that is the difference in the computing model. The second difference is the pricing model. In our traditional servers, we always pay for our servers 24/7 cost. As long as our
server is running, we pay for that. But in serverless based setup, we only pay for this interaction. The request, the moment our request comes here and our serverless instance spins up, whatever processing that we do here, we only pay for that amount of time, the CPU time, the memory time, whatever it is. The moment we send the response back, we are done with this. So our monthly cost is only going to be a sum of all these interactions, all the milliseconds that
the CPU was actually executing our code. So which in turn saves a lot of money because we're not paying for 24/7. Now the question is where is the catch? Since we already know no solution comes without its own set of trade-offs. So the number one problem the number one issue that we face when it comes to serverless computing is what we call as cold start times. Since the way serverless works is we do not have any pre-assigned machine. We do not have any machine which is ready to take any request. We only spin up a new machine a
physical server. Does not matter the capacity of the server but we only spin up a new physical instance of a server after getting the first request. And that instance only lives for a few more seconds until we give it back or we close the instance forever. Because of this nature, this time which it takes to spin up a new instance, this is what we call as a cold start time. This has always been the number one problem of using serverless technologies. Now there has been a lot of solutions people are still inventing to get around the cold start problem of serverless. Some of the
famous ones are keeping a few instances always ready through some kind of automated pings. Right? You automate a couple of requests from your own infrastructure so that every few seconds you send a request so that an instance comes up to cater to that request so that when an actual user request comes there is an instance always ready to serve to that user's request. So this hack exists but you cannot go overboard with it. Otherwise you'll end up with
always having a few instances of your servers ready and you again start paying for them 24/7 which completely defeats the problem of serless again. So that is one solution that we have come up with. Second is another one which is the most efficient one that you can say which is coming up with runtimes with minimum cold start. Some of the largest reasons of big cold start timing is two reasons.
One is booting up a new operating system or a new VM which comes with an operating system. Second is the language runtime. Whether it is it can be a Java application or a rust application or go application or a NodeJS application or a Python application whatever your language is. These are the two primary calls of cold start time. one if it is a traditional VM something like hypervisor or VM etc etc then that takes a long time to boot up a new operating system
to prevent the OS related the VM related cold start timing has come up with this new container technology new VM technology a new VM technology which works which is called as KVM technology called firecrackers which is what they use for their lambdas and this has become the indust industry standard for any kind of serverless based architecture. The second one, the second major one you can say is Cloudflare workers. They are again one of the largest providers of serverless
computing. They use V8 isolates. Now V8 is a JavaScript runtime which you can read more about. But V8 isolates, they provide an isolation level, a VM like functionality where your code can run which is extremely fast. They claim that it's something like 0 to 1 millconds to boot up a new V8 isolate. So these kinds of technology that we keep innovating they help with the cold start timing of operating system related. The second is language. A language like JavaScript or
Python which is an interpreted language which means it does not have any kind of compilation step. You just provide the code and the code starts running line by line. there is no pre-ompiled phase and because of that these languages are much faster in a serverless environment compared to a language like Java which has its own compilation JVM and all that and same for other system languages which has a compilation phase. So that is another big factor and that is why Cloudflare workers the one of the largest providers of serverless
computing they use V8 isolates which prevents the OS booting problem and they use JavaScript. So combined with these two they have extremely low cold start time somewhere around 5 millconds you can say but again the cold start timings will always be there. We can never beat the traditional serverful compute when it comes to cold start because traditional servers are always online always online to cater to our requests. But every time we talk about serverless computing the cold start problem will
always be there because of the nature of how serverless computing works. Now second problem first was the cold start timing. Second was limits. Most of these serverless providers levelless lambda they have some amount of limit. For example the list lambda functions they can run for at least like 15 minutes. So if you have a long running HTTP call something which takes a long amount of time like 30 minutes then this will fail halfway right. So most of the serverless computing environments they have certain amount of limits on how long they can keep the request alive. Third is
statelessness. Now this is a big problem which we discuss in depth in the serverless based video. But in short most of our setup in a traditional server it is based on the premise that our server is stateful which means that to interact with the database we establish TCP connections which is a active connection that we hold with our database. Same way if we have some kind of websocket based connection that we
established with our browsers. Similarly, we have a lot of stateful behaviors associated with our traditional servers. But the moment we are talking about serverless which because of their execution limits, they can only be alive for a few seconds or they cannot hold any data inside the operating system because every few seconds we are replacing the machines. We are running the same code but the machines in which the code is running they keep getting replaced depending on the user traffic and because of all
these behaviors our architecture when it comes to serless is primarily stateless. So all these behaviors of statefulness the TCP connections the websocket connections all of these have to be reimagined on how they work the moment you go and start using a serverless based architecture in your back end. So with all this when should you start using or when should you start migrating to a serverless architecture. Now it is easy to say when should you not use a
serverless architecture which is latency sensitive applications. Latency plays a huge role something like any kind of userfacing sensitive application like banking application or payment based applications where latency you cannot afford any amount of latency. Then serverless does not make a lot of sense. Then comes the long running procedures. If you have HTTP calls or websocket connections on somewhere then serverless does not make a lot of sense or applications which require a lot of database connections then also you'll have to figure out a lot of missing
components you have to opt for a serverless database if you make your backend serverless first. So some of the ideal use cases of serverless that we typically use is very specific operations in our workflow. One famous one is video processing or image processing or image resizing as it is since these are tasks or jobs in our application which is not very regular unless it is your application purely does video processing it is not very regular. So in a typical serverlessbased architecture we can just instead of running our heavy infrastructure heavy
infrastructure which handles the video processing the image processing etc instead of paying for that 24/7 we just keep a serverless instance and depending on when we get a request to process a new video or process a new image we can spin up a new serverless instance and finish the task and return a response. So that is a very good use case for serless or any kind of event based operations for example a new message appeared in cues or any kind of file upload that I just mentioned or
something like a database changes. So if you are thinking in the context of event triggering some kind of pipeline or event triggering some kind of operation there also serverless makes a lot of sense. Currently the industry is somewhat overhyped when it comes to serverless. It is definitely a powerful tool for very specific use cases but in my opinion not a universal replacement for servers but understanding where it fits why we have come up with it how it
works and when should you use it definitely is a very good tool in your arsenal whenever you are thinking in the context of scaling and performance. So we have covered a lot of topics in both of these videos about scaling and performance from the fundamentals of latency and throughput through database optimization and caching and vertical scaling horizontal scaling distributed systems and infrastructure components that come into play like API gateways load balancers and organization practices when it comes to microservices
etc etc. So let let me summarize it. Let me give you a couple of thumb rules, a couple of mental models that you can take away from all this information. First one is always start with a problem. All these things, all these techniques, all these topics that we discussed in these videos, they are primarily solutions. So even though we came from a context of what problem they solve, but still the discussions were about solutions. So before you start reaching for solutions, you should understand where you stand with your
problem. Whether your system is actually slow or not and how slow is that you have to actually measure your system using practices like observability, using traces, metrices, logs and performance testing, load etc. So measuring is a very big part which answers your question where exactly your system is slow. Which component is exactly slow in your whole system? What exactly is your bottleneck? Unless you know the answers to these and you know the specific answer to this, you have to know that if your database is the
bottleneck, then it is the bottleneck because of the absence of this particular index. It is the bottleneck because of the absence of sharding. Unless you know the specific answers all these solutions that we talked about you should not opt for them because when we do that that's what we call as premature optimization which is not something I would advise so always measure first measure everything then profile trace your requests there are tools that you can configure like Prometheus Graphana to trace every single request to find
out where your bottlenecks are or you can go with some paid service or some managed tool something like New Relic which makes measuring your system a lot easier. And after doing all this, understand where actually the time goes before trying to make your system faster because the worst thing that you can do while solving your performance related problems is solving the wrong problem or fixing the wrong bottleneck because it
gives you the impression that you have fixed your problem which in turn becomes a much larger problem along the way. So if you want to take something from both of these videos, this advice, measure every single interaction of your system using tools like New Relic or if you want to go to the custom solution something like Prometheus Graphana only after you have measured every single component of your system only after you know all the numbers. What is your average latency? What is your average
amount of errors and everything once you have all the numbers in front of you then only go for solutions only after that go for caching go for aggressive indexing and whatever all the solutions that we discuss in this video. Second always prefer simple solutions. What do we mean by simple solutions? One example that we discussed just before this was the microser architecture. Now microser architecture is one of those things which do have a lot of benefits but the complexities that they come with. It is
usually not worth switching and microservices are not what we call as a simple solution. It can get very complicated unless you have experience dealing with and handling microservices. So that is one example. Same way while we are discussing horizontal scaling dealing with something like Kubernetes if you are interacting with your infra part dealing with Kubernetes is again what we will not consider as a simple solution. So unless you have no choices
left and unless you know you definitely need this particular tool then only go for these complicated solutions otherwise always try to stick to simple solution because they are easy to understand they are easy to debug and they are easy to operate. So starting with a large server with a large vertical scaling based solution is always simpler than going with like from day one of development going with a horizontal scaling based architecture. Right? Same way indexing your database queries. Implementing proper indices for
your database query is a much simpler solution than placing a radius cache in front all of your database queries. And a monolith is always simpler than going with a micros servicebased architecture because complexity has costs. It always has costs. Every component that you add to a system is another component that can fail is another component that you have to monitor. Is another component that you have to understand and operate. So only accept complexity when simplicity is genuinely insufficient. You have no other choices. Of course,
this does not mean that you always choose the simplest possible approach. Sometimes complexity is necessary, but you have to understand the fact that complexity is a cost and it requires enough amount of justification before you start implementing it. Third, scale for the problems you have. What do you mean by this? You do not need to build for a million users on your first day of development. you probably will not ever
reach a million users. Most platforms do not make it. So build for the scale that you currently have with some headroom for growth, keeping some amount of buffer, reasonable buffer. And as you grow, you will learn where your bottlenecks are. As long as you have properly implemented your observability, which means your logs, metrics, and traces. Your specific application has its own set of specific characteristics. So generic advice from internet from
from engineering blogs that are from big companies Netflix or Facebook or Google they might not apply to your specific situation. So learning from measuring your system is much more valuable and a much more solid approach as compared to taking some performance engineering blog from a big company and start implementing their solution for your systems. Which brings me to the fourth problem and measure. Now I cannot stress this enough. Observability is one of
those components in your backend systems that you should have from day one. So in the first point as I said that always go for the simplest solution. So this is one of those cases where you have to make an exception. Now you might think that on your day one going for simple logs and ignoring metrics and traces are a good way until you reach like 50,000 users or one lakh users. But from my experience, measuring your systems, which means implementing proper
production grade observability from day one always pays off. You never have to face random errors. You never have to face random crashes in your system. You never have to guess where your system's bottleneck is. Everything is in front of your face all the time. And as your system grows, you have enough headroom. You get enough headsups that you can scale. You can make changes to your system. You can make changes to your infra before they cause any kind of problems because you measure every single thing. So this is very important.
Now the last thing is mindset. Similarly as we talked about in the security video, security is a mindset. You have to constantly learn about new threats. You have to constantly learn about new ways to secure our application. Same way performance optimization and scaling is an ongoing learning phase. It is a mindset and it comes from experience. You will build your systems and you will watch your systems struggle and you will optimize and see what helps and what does not help. And over the time after a
lot of trial and errors you can say that your system can scale. Your system has all the performance tweaks all the performance related changes all the scaling needs so that it can cater to any amount of user traffic. At least the kind of user traffic your application gets, not the generic number based user traffic. And a single tutorial cannot teach you all that. Your job as a backend engineer is not to predict all kinds of problems that can happen, but to build systems that handle problems
when they do happen in a graceful manner and to develop skills so that you can measure these things and so that you can diagnose these problems and you can resolve them quickly when they do